S 6 Structural Variant Calling
نویسندگان
چکیده
S6.1 Genome-wide Structural Variant Detection We used whole-genome shotgun paired-end sequence data generated with both Illumina and Applied Biosystems SOLiD platforms from the genomes of six canid samples (including a additional Basenji only sequenced to low coverage on the Illumina platform, but excluding the Chinese wolf), to estimate the fraction of the genome with segmental duplications. Our goal was to determine potentially duplicated regions to filter out for the final SNP call set. We identified the segmental duplication (SD) content in these genomes using the Whole-genome Shotgun Sequence Detection (WSSD) approach [1]. This strategy is based on determining regions with a significant excess of depth of coverage. Briefly, WGS reads are allowed to map to multiple locations to a reference genome, and therefore we expect that paralogous copies map into all locations. Highly identical duplicated genomic regions would be detected with an excess of depth of coverage. In our case, we used the dog assembly (canFam2) downloaded from the UCSC Genome Browser. Repeats detected by RepeatMasker and simple tandem repeats with period smaller than 12 detected by the Tandem Repeat Finder were pre-masked. We aligned the Illumina reads allowing 94% of sequence identity using mrFAST v2.0.0.5 [2] and SOLID reads with drFAST v0.0.0.3 [3]. We calculated the absolute copy numbers of non-overlapping windows of 1 kb of unmasked sequence using mrCaNaVaR version 0.31 (http://mrcanavar.sourceforge.net/). We identified SDs as regions with at least 5 consecutive windows with a copy number higher than 2.5. We detected between 1,379 and 1,413 SD segments larger than 10 kb in the five genomes we analyzed. These regions comprise 52.77 to 55.01 Mb in total that correspond to 2.09% to 2.17% of the reference assembly (Table S6.1.1).
منابع مشابه
SMaSH: a benchmarking toolkit for human genome variant calling
MOTIVATION Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and in...
متن کاملUsing Genome Query Language (GQL) to uncover genetic variation
Motivation:With high throughput DNA sequencing costs dropping below $1, 000 for human genomes, data storage, retrieval, and analysis are the major bottlenecks in biological studies. In order to address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows f...
متن کاملA review of somatic single nucleotide variant calling algorithms for next-generation sequencing data
Detection of somatic mutations holds great potential in cancer treatment and has been a very active research field in the past few years, especially since the breakthrough of the next-generation sequencing technology. A collection of variant calling pipelines have been developed with different underlying models, filters, input data requirements, and targeted applications. This review aims to en...
متن کاملChangepoint Analysis for Efficient Variant Calling
We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference,...
متن کاملInexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes
Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands o...
متن کاملSequence analysis FermiKit: assembly-based variant calling for Illumina resequencing data
Summary: FermiKit is a variant calling pipeline for Illumina whole-genome germline data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions and structural variations. FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85 GB RAM at the peak, and calls variants in hal...
متن کامل